{"id":38328,"date":"2011-05-04T08:05:40","date_gmt":"2011-05-04T08:05:40","guid":{"rendered":"https:\/\/www.vmengine.net\/2011\/05\/04\/datacenter-failure-downtime-in-not-an-option\/"},"modified":"2025-05-23T17:09:34","modified_gmt":"2025-05-23T17:09:34","slug":"datacenter-failure-downtime-in-not-an-option","status":"publish","type":"post","link":"http:\/\/temp_new.vmenginelab.com\/en\/2011\/05\/04\/datacenter-failure-downtime-in-not-an-option\/","title":{"rendered":"DataCenter Failure : downtime in not an option"},"content":{"rendered":"<p><!--:it--><a href=\"http:\/\/temp_new.vmenginelab.com\/wp-content\/uploads\/2011\/04\/dilbert-downtime1-2.jpg\"><img fetchpriority=\"high\" decoding=\"async\" class=\"size-medium wp-image-1165 alignleft\" style=\"margin-left: 10px; margin-right: 10px;\" title=\"Downtime\" src=\"http:\/\/blog.vmengine.net\/vmengineblog\/wp-content\/uploads\/2011\/04\/dilbert-downtime1-300x261.jpg\" alt=\"\" width=\"300\" height=\"261\"><\/a>  We talk about evaporated data, turbulence, hurricanes, storms, each one uses its own metaphor associated with clouds, precisely because we talk about cloud computing, disaster recovery, downtime in datacenters. We are talking about<strong> Amazon Web Service<\/strong>,<strong> Aruba<\/strong>, <strong>Google Mail<\/strong>, clustered storage systems, we are talking about centralized data, centralized services in datacenters that are created to do this job, human errors, planning errors and also the lightness of customers in relying completely without perhaps having read the SLA contracts, without having correctly implemented their system with the services provided by the provider,  finally, without having assessed the economic and image damage resulting from downtime, the famous <strong>business continuity<\/strong>.<\/p>\n<h1>AWS Outage<\/h1>\n<p><span style=\"font-weight: normal;\">Let&#8217;s start with Amazon, as already indicated in this<\/span> <a style=\"font-weight: normal;\" href=\"http:\/\/blog.vmengine.net\/2011\/04\/21\/aws-us-east-dc-problems\/\">other post<\/a> on <span style=\"font-weight: normal;\"> April 21st there were problems in a datacenter in the us-east area (Virginia), at the <\/span><span style=\"font-weight: normal;\"><strong>fundamental services EC2 <\/strong><\/p>\n<p><\/span><span style=\"font-weight: normal;\">and <\/span><span style=\"font-weight: normal;\"><strong>RDS<\/strong><\/p>\n<p><\/span><span style=\"font-weight: normal;\">. These have grounded some famous social services of the internet such as  <\/span><strong>Foursquare<\/strong><span style=\"font-weight: normal;\">, <\/span><strong>Reddit <\/strong><span style=\"font-weight: normal;\">and <\/span><strong>Quora<\/strong><span style=\"font-weight: normal;\">.<\/span><\/p>\n<p>But <strong>NetFlix<\/strong>, <strong>Twilio<\/strong>, and others, while using resources in the same availability zone, did not suffer the same fate, <strong>Reddit <\/strong>itself almost immediately returned online thanks to a strong support contract with Amazon that guaranteed it dedicated engineers. Amazon gives us a <a href=\"http:\/\/aws.amazon.com\/message\/65648\/\" target=\"_blank\" rel=\"noopener\">summary <\/a>at the end of the story where it illustrates what happened and how they handled the event, what they learned and how they will move in the immediate future to prevent it from happening again, apologize to all customers and talk about refunding credits for customers. Obviously, the reimbursement is not measured according to the extent of the damage suffered, but this depends on the type of contract you enter into with the provider.<\/p>\n<h3>EBS Storage synch problems<\/h3>\n<p>For the <strong>EC2 <\/strong>service, the trigger was a subset of <strong>EBS disks <\/strong>in an Availability Zone. They seem to have frozen, unable to read and write, obviously the instances that depended on these disks were also blocked. They disabled the control APIs of this cluster of EBS in that affected area. At this point, Amazon briefly explains how the EBS disk service and its cluster work. As many experts will have already guessed, EBS is a distributed storage, consisting of a series of clusters containing data in consistent replication and at the block level with each other, and a series of other systems useful for coordinating user requests to the nodes. Data replicas between cluster nodes are ensured by a peer-to-peer fast-failover strategy to trigger new replicas if one goes out of sync or is no longer available. The nodes are connected to each other by two networks, the first broadband and the secondary of lower capacity used as a backup and expansion network for data replication. The second network was not created to manage all the traffic of the primary but to provide very reliable connectivity between the nodes.<\/p>\n<p>The problem arose due to <span style=\"text-decoration: underline;\">human error<\/span> during a normal scaling activity of the primary network of an EBS cluster, the change was supposed to serve to boost its capacity, but the traffic that had to be moved to transfer the update was mistakenly transferred to the secondary network which did not hold up and the replicas jumped. The problem of EBS storage has obviously also impacted the <strong>RDS <\/strong>relational database service, which is totally dependent on it<\/p>\n<p>According to an analysis by <strong>RightScale <\/strong>there would have been more than <strong>500k <\/strong>EBS volumes affected, it also claims that an event of this magnitude exceeds the design parameters, cannot be tested and that there is no comparable scale system in operation anywhere else.<\/p>\n<p>Amazon states that it will make a series of changes to improve itself and avoid the recurrence of this type of event.<\/p>\n<p>An interesting comment by <strong>Rightscale&#8217;s Lew Moorman<\/strong> in an interview with the <strong><br \/>\n  <a href=\"http:\/\/www.nytimes.com\/2011\/04\/23\/technology\/23cloud.html?scp=1&amp;sq=Lew%20Moorman&amp;st=cse\" target=\"_blank\" rel=\"noopener\">New York Times<\/a><br \/>\n<\/strong> : &#8220;Amazon&#8217;s outage is the cyber equivalent of a plane crash. This is an important episode with widespread damage. But air travel is still safer than traveling by car \u2013 analogous to cloud computing being more secure than data centers run by individual enterprises. Every day, in companies around the world, there are technology outages, each episode is very small, but it can waste more time, money and business.&#8221;<\/p>\n<h3>AWS Lessons and the Right Approaches to Using It<\/h3>\n<p>What can the customer do to correctly use the services mentioned to overcome the technical problems of the provider? First of all, the EC2 service used simply and individually does not guarantee high availability, but has an SLA of 99.95%, the same applies to RDS which depends on EC2 and EBS. But Amazon itself communicates that a correct use of services leads to highly reliable solutions. For example, using multiple deployment zones (<strong>NetFlix <\/strong>uses three), using EBS snapshots creates the ability to replicate the volume to other Availability Zones (the <strong>snapshot <\/strong>is physically located on the S3 system), back up data to S3, RDS backups and <strong>snapshots <\/strong>, or even enable replication on <strong>multi-AZ<\/strong> (between different Availability Zones). These are the approaches that have prevented certain customers from being offline despite the provider&#8217;s problems.<\/p>\n<h1><strong>Aruba<\/strong><\/h1>\n<p>Based on Aruba&#8217;s statements on the following communication: <a href=\"http:\/\/ticket.aruba.it\/News\/212\/webfarm-arezzo-aggiornamenti-3.aspx\">http:\/\/ticket.aruba.it\/News\/212\/webfarm-arezzo-aggiornamenti-3.aspx<\/a><\/p>\n<p><em>This morning at h. 04:30, a short circuit that occurred inside the battery cabinets serving the UPS systems of Aruba&#8217;s Arezzo Server Farm caused a fire: the fire detection system immediately went into operation, which in sequence turns off the air conditioning and activates the extinguishing system. As the smoke released by the combustion of the plastic batteries completely invaded the premises of the structure, the system interpreted the persistence of smoke as a continuation of the fire and automatically cut off the electricity.<\/em><\/p>\n<p>The UPS system should be a switch of the main mains power supply, while a design error (<span style=\"text-decoration: underline;\">human error<\/span>) of the ventilation system of the UPS room, caused the shutdown of all systems, which for the Italian market meant the offline of millions of customer sites. As Aruba itself says in the press release, this error will be solved:<\/p>\n<p><em>In addition, although it is customary to install batteries inside the data center, to avoid a repeat of what happened, from today the batteries of the Arezzo data center and all the other data centers of the Aruba Group will be installed in special rooms, external and separate from the main structure.<\/em><\/p>\n<h1>Google Gmail<\/h1>\n<p>In the case of the outage for some Gmail mail customers last February, as communicated in <a href=\"http:\/\/static.googleusercontent.com\/external_content\/untrusted_dlcp\/www.google.com\/it\/\/appsstatus\/ir\/nfed4uv2f8xby99.pdf\">http:\/\/static.googleusercontent.com\/external_content\/untrusted_dlcp\/www.google.com\/it\/\/appsstatus\/ir\/nfed4uv2f8xby99.pdf<\/a>, the outage was caused by a bug inadvertently introduced in a software update (<span style=\"text-decoration: underline;\">human error<\/span>), and to avoid data integrity issues disabling access to Google Apps for the number of customers affected, the engineering team had to restore mailboxes from backup tapes, confirming that tape backups are still in use and still reliable.<\/p>\n<h1>Conclusions<\/h1>\n<p>Despite these incidents, as Lew Moorman says in the NYT interview, the large data centers managed by these large entities are always safer than the solutions that small and medium-sized companies could adopt.<\/p>\n<p>Instead, the discussion should be shifted to a very complex issue that starts from the following observation:<\/p>\n<p><em><span style=\"text-decoration: underline;\">because Facebook, Google and Amazon build servers (Facebook and Google specifically), datacenters (see Facebook <a href=\"http:\/\/opencompute.org\/\" target=\"_blank\" rel=\"noopener\">OpenCompute <\/a>Project), modify or create opensource software projects for their own needs (see Google&#8217;s Bigtable), for example the S3 EBS storage systems (they seem to use <a href=\"http:\/\/www.drbd.org\/\" target=\"_blank\" rel=\"noopener\">DRDB<\/a>), SDB, where the beating heart are batteries of classic but powerful servers,  dedicated network systems that replicate data between the numerous nodes, i.e. proprietary software solutions or modifications of opensource projects born in some university in the world and perhaps still under development on some world share.<\/span><\/em><\/p>\n<p>the question is provocative for the vendors of maxi systems (IBM, HP, Dell, etc), but the answer could be in the following old publications (  <a href=\"http:\/\/blog.vmengine.net\/2008\/07\/08\/isilon-contatti-e-valutazione-offerta-commerciale\/\" target=\"_blank\" rel=\"noopener\"><a href=\"http:\/\/blog.vmengine.net\/2008\/07\/08\/isilon-contatti-e-valutazione-offerta-commerciale\/\">Isilon Technology<\/a><\/a>,  <a href=\"http:\/\/blog.vmengine.net\/2010\/11\/17\/emc-compra-isilon\/\" target=\"_blank\" rel=\"noopener\"><a href=\"http:\/\/blog.vmengine.net\/2010\/11\/17\/emc-compra-isilon\/\">EMC buys Isilon<\/a><\/a>,  <a href=\"http:\/\/blog.vmengine.net\/2008\/10\/13\/hplefthand-acquisizione\/\" target=\"_blank\" rel=\"noopener\"><a href=\"http:\/\/blog.vmengine.net\/2008\/10\/13\/hplefthand-acquisizione\/\">HP buys LefHand<\/a><\/a>, and other recent acquisitions), i.e. the big vendors only a few years ago began to understand the need to specialize in clustered distributed storage systems, because only these systems have the ability to respond to large amounts of data and large needs for bandwidth and simultaneous access, in addition to the enormous needs of Amazon, Google, Facebook tip the economic balance towards open or proprietary solutions compared to the licensing and support costs they would have through the vendors of the past.<\/p>\n<p>In short, most of the software or hardware solutions of the services that we use and will use more and more, belonging or not to the cloud computing paradigm, are systems that, due to their size and scope, will never be effectively tested to avoid disaster.<!--:--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We talk about evaporated data, turbulence, hurricanes, storms, each one uses its own metaphor associated with clouds, precisely because we talk about cloud computing, disaster recovery, downtime in datacenters. We are talking about Amazon Web Service, Aruba, Google Mail, clustered storage systems, we are talking about centralized data, centralized services in datacenters that are created [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":28754,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[97],"tags":[2027,2028,2029,2030,2031,134,1406,2032,2033,1264,135,2016,2034,2035,2018,2036,2019,1055,131,228],"class_list":["post-38328","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog-en","tag-aruba-disaster-en","tag-aruba-outage-en","tag-aws-disaster-en","tag-aws-outage-en","tag-business-continuity-en","tag-cloud-computing-en","tag-datacenter-en","tag-disaster-recovery-en","tag-downtime-en","tag-ebs-en","tag-ec2-en","tag-foursquare-en","tag-gmail-outage-en","tag-multi-az-en","tag-quora-en","tag-rds-en","tag-reddit-en","tag-snapshot-en","tag-stories-en","tag-technical-en"],"aioseo_notices":[],"jetpack_featured_media_url":"http:\/\/temp_new.vmenginelab.com\/wp-content\/uploads\/2011\/04\/dilbert-downtime1-1.jpg","amp_enabled":true,"_links":{"self":[{"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/posts\/38328","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/comments?post=38328"}],"version-history":[{"count":1,"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/posts\/38328\/revisions"}],"predecessor-version":[{"id":41031,"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/posts\/38328\/revisions\/41031"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/media\/28754"}],"wp:attachment":[{"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/media?parent=38328"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/categories?post=38328"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/temp_new.vmenginelab.com\/en\/wp-json\/wp\/v2\/tags?post=38328"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}